Base Map with rnaturalearth Package

To create maps of the world (or of portions of the world), we will use the rnaturalearth package. It provides similar functionality to the albersusa package, but it provides boundaries of countries around the world rather than boundaries of states or counties within the United States. Like albersusa, it produces objects of class sf (“simple feature”). We’ll also use plotly to add interactivity to our maps. We begin by loading packages:

library(tidyverse)
library(plotly)
library("rnaturalearth")
library("rnaturalearthdata")

The following command creates a simple feature dataset with the ploygon data for the boundaries of all countries in the world:

world <- ne_countries(scale = "medium", returnclass = "sf")

Just like we did with U.S. states, we can now create a world map using ggplot() with geom_sf():

ggplot(world) +
  geom_sf() 

This creates our base map. We are now ready to use a dataset to add color fill to each country on the map.

Example: Creating a World Map with the TB Data

For this example we will return to the World Health Organization Tuberculosis data from the tidyr package. Let’s create a world choropleth map with countries colored by the rates of new tuberculosis diagnoses in 2013. We begin by loading and tidying the data, using the commands developed in a previous handout:

data(who)
who_tidy <- who %>%
  pivot_longer(new_sp_m014:newrel_f65, names_to = "groups", values_to = "cases") %>%
  mutate(groups = str_replace(groups, "newrel", "new_rel"),
         country = str_replace(country, "Cote d'Ivoire", "Côte d'Ivoire"), 
         country = str_replace(country, "Curacao", "Curaçao")) %>%
  separate(groups, c("new", "type", "sexage"), sep="_") %>%
  select(-new, -iso2) %>%
  separate(sexage, c("sex", "age"), sep = 1) %>%
  left_join(population, by = c("country" = "country", "year" = "year"))

Next, we compute TB rates for each country.

TB_rates <- who_tidy %>%
  filter(year == 2013) %>%
  group_by(country) %>%
  summarize(TB_rate = sum(cases, na.rm=TRUE)/median(population, na.rm=T))

Next we need to join the TB rate date with the map data. The names of countries in the TB_rates data are contained in a variable named country. For the map data, there are numerous columns with what appear to be different versions of names of each country. Let’s examine them somewhat more closely by looking at world in the data viewer.

  • The first such variable, sovereignt, is not the correct variable to use. Note for example, that “United States of America” is listed in roughly the 10th row. However, scrolling to the right, we see that this row is actually American Samoa (which is a U.S. territory). The TB data has a separate row for American Samoa, so we do not want to match the U.S. data to this row. To see a bit more detail of what is happening with this variable, click on the “Filter” button at the top of the data viewer pane. In the box at the top of the sovereignt column, type “United”. You will see that there are multiple rows for both the United States and United Kingdom. For the United States, there are six different rows, including American Samoa, Guam, Northern Mariana Islands, Puerto Rico, United States, and U.S. Virgin Islands. In the TB data, all six of these have their own separate rows. Thus sovereignt cannot be used to uniquely match up rows from the two datasets.
  • Skimming through the columns of the dataset in the viewer, some other possible variables that might be useful for identifying names of countries include admin, name, name_long, or geounit (there are others as well, but I’ll focus on these for now). Let’s take a look at some values of each of these, focusing on rows where two of them (admin and name_long) do not agree.
filter(world, admin != name_long) %>% select(admin, name, name_long, geounit) %>% View()
  • It is hard to know which of these might work best. Note, however, that name uses abbreviations in some cases, so it may not be the best choice. Both name and name_long include some special characters like accents rather than just plain text. Our TB data does not appear to include any special characters, so these variables may not be the best choices.
  • The variables admin and geounit appear to be pretty similar. Let’s just pick one (I’ll use admin) and give it a try.
  • As a final note, we can check the number of “unique” values in a column using a command like length(unique(world$sovereignt)), which tells us that there are only 200 distinct names in the 241 rows. As we’ve already noted, this suggests that it will not be a good variable for matching rows to another dataset. On the other hand, length(unique(world$admin)) indicates that all 241 rows are distinct.

Now let’s join the TB data with the map data, using admin from the map data and country from the TB data to match rows.

world_TB <- world %>%
  left_join(TB_rates, by = c("admin" = "country"))

With the data joined, let’s create a choropleth map in the same manner that we did with U.S. maps:

ggplot(world_TB) +
  geom_sf(aes(fill=TB_rate), color="black") +
  scale_fill_continuous(low="yellow", high="red", labels = scales::percent)

Something is wrong here – quite a few countries, such as Russia and Iran, are shaded gray due to missing data. A quick look at the TB_rates data (just run View(TB_rates)) indicates that we do have data on the TB rates for both of these countries, so something must have gone wrong in the merging of the map polygon data with the TB rate data. Note that in the original TB data (TB_rates) there were 217 rows, and all 217 had values for TB_rate, as shown by the command below.

sum(!is.na(TB_rates$TB_rate))
## [1] 217

However, when we merged with the map data, we only successfully matched 187 of those countries with their map data.

sum(!is.na(world_TB$TB_rate))
## [1] 187

Let’s take a look at our data used to produce the map and see which countries have missing values (NA) for the TB rate:

world_TB %>%
  filter(is.na(TB_rate)) %>%
  select(admin) %>%
  View()

We see that there are 54 countries in the map data which do not have data for TB rates. In some cases this is correct – for example, Antarctica is a “country” in the map data, but we have no TB data for Antarctica. On the other hand, in other cases, such as Russia and Iran, we do have data, but it was not matched properly in the join process. For example, Russia is “Russian Federation” in the TB data, but is just “Russia” in the map data, so they did not match up correctly in the join. Similarly, Iran is “Iran (Islamic Republic of)” in the TB data, but is just “Iran” in the map data. We could attempt to manually fix the mismatches, but with 54 unmatched countries, this would be quite tedious. We need a better approach to the matching.

Matching names of countries across different datasets is always a challenge since even a slight variation in the name (including abbreviations or accent marks) will make a direct match fail. This is the reason that the International Standards Organization developed country codes like “ISO3” to give consistent names/labels for countries that can be used across datasets. Thankfully, the ISO3 code for each country is included in both TB data (iso3) and the map data (iso_a3).

Our current version of the TB_rates data does not include iso3, so we will need to recreate it to include iso3. I will also keep the name of the country in order to have a more informative label for each row rather than just the ISO3 code. Note that when we use summarize() with group_by(), only the grouping variable (country) and any variables computed in summarize() are kept in the resulting data frame (that’s why iso3 was dropped in our original version of TB_rates). To keep both, I will include both as grouping variables in group_by(). In terms of the grouping this is redundant since each country has a unique ISO3 code, but it makes the resulting data frame include both variables.

TB_rates <- who_tidy %>%
  filter(year == 2013) %>%
  group_by(country, iso3) %>%
  summarize(TB_rate = sum(cases, na.rm=TRUE)/median(population, na.rm=T))

Now we use the ISO3 code to join the map data and the TB data:

world_TB <- world %>%
  left_join(TB_rates, by = c("iso_a3" = "iso3"))

Based on the command below, we see that this time we successfully matched the TB_rate for 214 of the 217 countries with actual values for TB_rate in the original TB data. This is a dramatic improvement over our first attempt.

sum(!is.na(world_TB$TB_rate))
## [1] 214

Let’s check which 3 countries from the TB data were not matched with map data. I’ll use the function anti_join(), which returns observations from the first dataset which do not have matches in the second dataset. Since I want to know the observations from TB_rates that do not have matches in the map data, I’ll put the TB_rates data first.

anti_join(TB_rates, world, by = c("iso3" = "iso_a3"))

A little additional exploration suggests that none of these three countries are included in the map data. A little searching on Google indicates that all three are tiny islands which are territories of other nations. Even if we could match them up, they are so small that they would be invisible on our map. We have done the best we can with the available data to match up TB data for countries with the map data, so we’ll now move on to finalize our map.

Let’s recreate my_map_theme() and then produce a polished version of our world map:

my_map_theme <- function(){
  theme(panel.background=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank())
}

ggplot(world_TB) +
  geom_sf(aes(fill=TB_rate*10000), color="black") +
  scale_fill_continuous("TB Rate\nper 10,000\nresidents", low="yellow", high="red") +
  my_map_theme() +
  labs(title = "Tuberculosis Rates by Country in 2013",
       subtitle = "Rate of new tuberculosis diagnoses in 2013\n(cases per 10,000 residents)") +
  theme(plot.title = element_text(hjust=0.5, size = 18)) +
  theme(plot.subtitle = element_text(hjust=0.5))

Regional Maps

We can also easily filter the data to look at particular regions of the world. There are a variety of variables in the map data which can facilitate this, including continent, region_wb (World Bank regions), region_un (United Nations regions), and subregion. To see the possible values for these variables, you can run a command like the following in the console:

unique(world$region_wb)
## [1] "Latin America & Caribbean"  "South Asia"                
## [3] "Sub-Saharan Africa"         "Europe & Central Asia"     
## [5] "Middle East & North Africa" "East Asia & Pacific"       
## [7] "Antarctica"                 "North America"

From the original world map, we see that many of the worst problems with tuberculosis appear to be in Sub-Saharan Africa. We can create a map that focuses only on this region by filtering as follows. Note that to create the filter, I simply copied the exact name of the region (“Sub-Saharan Africa”) from the output above.

world_TB %>%
  filter(region_wb=="Sub-Saharan Africa") %>%
  ggplot() +
  geom_sf(aes(fill=TB_rate*10000), color="black") +
  scale_fill_continuous("TB Rate\nper 10,000\nresidents", low="yellow", high="red") +
  my_map_theme() +
  labs(title = "Tuberculosis Rates by Country in 2013",
       subtitle = "Rate of new tuberculosis diagnoses in 2013\n(cases per 10,000 residents)") +
  theme(plot.title = element_text(hjust=0.5, size = 18)) +
  theme(plot.subtitle = element_text(hjust=0.5))

Interactive Maps with ggplotly

Finally, we can make our m,ap interactive through the use of ggplotly() from the plotly package. To use ggplotly(), we first create a graph or map using the standard functionality of ggplot(), and then we pass that plot as the first argument to ggplotly(). Note that some details of plots created in ggplot(), such as subtitles and captions and some details of themes, are not implemented in ggplotly(). These layers of a ggplot object will be ignored and dropped by ggplotly. Furthermore, plotly objects have a toolbar at the top which often ends up under the title when a title is included in a plot. As a result, I generally recommend not using titles, subtitles, or captions at all when working with ggplotly. An interactive ggplotly object is included as part of an HTML file, and annotations like titles and captions can be added on the HTML page above or below the plot rather than as a part of the plot object itself.

To create an interactive world map using the TB data, we’ll need to recreate our original map with a few modifications:

  • Use a mutate command to create a text label for each country that will pop up as a hover effect in the map. This new variable is added to the dataset before passing the data to ggplot().
  • Add “text” as an aesthetic in geom_sf(). This aesthetic will be ignored by ggplot(), but will later be used by ggplotly().
  • Remove title, subtitle, and caption.
p <- world_TB %>%
  mutate(text = paste("<b>",admin,"</b>\nTB rate:",signif(TB_rate*10000,3),
                      "per 10,000 residents")) %>%
  ggplot() +
    geom_sf(aes(fill=TB_rate*10000, text=text), color="black") +
    scale_fill_continuous("TB Rate\nper 10,000\nresidents", low="yellow", high="red") +
    my_map_theme() 
## Warning: Ignoring unknown aesthetics: text

The command above creates the plot, but does not display it. We now pass the plot to ggplotly(). The tooltip argument tells ggplotly() what information to display when we hover over objects in the plot. The style(hoveron = "fill") command tells it to display the text when we hover over the interior fill of a country rather than the boundary of the country.

ggplotly(p, tooltip = "text") %>%
  style(hoveron = "fill") 

Note that the toolbar at the top of the map allows us to zoom, pan, rescale, and explore the map in a variety of ways.

A Final Fix…

One odd issue appeared in the interactive map above. There are a several countries (including India, Thailand, Myanmar, and Papua New Guinea) for which the hover text for just says “Trace 1” rather than reporting any actual data. It turns out that all of these countries have 0 reported cases of Tuberculosis in the data (I’m not sure that I trust those numbers, but that’s a completely separate issue). Similarly, countries/regions with no data (like Antarctica and Taiwan) just say “Trace 191”. This is a result of a known issue in ggplotly(), and fixing the problem is a bit tricky.

The issue is that when multiple polygons in a map have identical values for the fill aesthetic, they are all sent to ggplotly() together as a single “trace” which gets the same fill (this is done to improve efficiency). When we subsequently tell it to provide “tooltips” based on hovering over the “fill”, there is no way for it to separate these countries from each other to provide the correct text label.

So how do we fix this? One way is to slightly change the values of TB_rate (the variable used for fill) so that none of them are exactly equal. A simple way to do this is to add a very small amount of random “noise” to each value that is so small that it will not affect the color or labels. Ignoring the 0’s, the rest of the TB rates per 10,000 residents range from a minimum of about 0.088 to a maximum of about 59.19. I’ll add a random number between 0 and 0.0001 to each of these values and then recreate the graph. I can do this using runif(), which generates random values from a uniform distribution.

One other minor change: I changed style(hoveron = "fill") to style(hoveron = "fills"). Using fill resulted in having Trace 191 pop up on hovering over these regions, but changing it to fills completely suppresses the labels on the countries/regions with NA values for TB_rate. [Full disclosure: I have no idea why it works this way. But it does…]

p <- world_TB %>%
  mutate(text = paste("<b>",admin,"</b>\nTB rate:",signif(TB_rate*10000,3),
                      "per 10,000 residents")) %>%
  ggplot() +
    geom_sf(aes(fill=TB_rate*10000+runif(nrow(world_TB), min=0, max=0.0001), 
                text=text), color="black") +
    scale_fill_continuous("TB Rate\nper 10,000\nresidents", low="yellow", high="red") +
    my_map_theme() 
## Warning: Ignoring unknown aesthetics: text
ggplotly(p, tooltip = "text") %>%
  style(hoveron = "fills")